Live HTML¶

Advanced Graphics and Data Visualization in R¶

Lecture 05: Dimension Transformation¶

0.1.0 An overview of Advanced Graphics and Data Visualization in R¶

"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.

This lesson is the fifth in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.


0.2.0 Lecture objectives¶

This week will focus on exploring the relationships within your data observations comparing between experimental samples and applying a suite of cluster, dimensions reduction and projection algorithms.

At the end of this lecture you will have covered visualizations related to

  1. Clustering with heatmaps and dendrograms
  2. K-means clustering
  3. Principal component analysis
  4. Non-linear projections

0.3.0 A legend for text format in Jupyter markdown¶

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Lecture and data files used in this course¶

0.4.1 Weekly Lecture and skeleton files¶

Each week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2023-03-Adv_Graphics_R/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.

Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.

0.4.2 Live-coding HTML page¶

A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!

0.4.3 Post-lecture PDFs¶

As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus.

0.4.4 Data used in this lesson¶

Today's datasets will focus on the smaller datasets we worked on in earlier lectures (namely our Ontario public health unit COVID-19 demographics data), and a new set of RNAseq analysis on different tissue samples from COVID-19 patients

0.4.4.1 Dataset 1: phu_demographics_census_norm.csv¶

This is a combination of datasets from previous lectures. This incorporates PHU demographic data with StatsCan census data from 2017 to produce a normalized estimate of cases, deaths, and hospitalizations across age groups and public health units in Ontario.

0.4.4.2 Dataset 2: Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv¶

This is the same readcount data we looked at in Lecture 04. RNA-Seq read count data generated from SARS-CoV2 infections of AEC cells. Used to compare the timecourse of expression (pro-inflammatory) changes in samples treated with and without HSP90 inhibitors. Published in iScience doi: https://doi.org/10.1016/j.isci.2021.102151

0.4.4.3 Dataset 3: GSE150316_DeseqNormCounts_final.txt¶

From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset has normalized expression counts from RNAseq data. It covers multiple samples and tissues from COVID-positive patients with a focus on lung tissue. The expression data has been unfiltered for SARS-CoV-2 expression data as well.

0.4.4.4 Dataset 4: 2020.07.30.20165241-supp_tables.xlsx¶

From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset contains patient information like viral load from tissues that were used for RNAseq.


0.5.0 Packages used in this lesson¶

repr- a package useful for altering some of the attributes of objects related to the R kernel.

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

RColorBrewer has some hlepful palettes that we'll need to colour our data.

gplots will be used to help generate heatmap/dendrogram visualizations.

FactoMineR and factoextra will be used for PCA generation and visualization

Rtsne and umap are packages implementing non-linear projection algorithms

In [ ]:
# Some packages can be installed via Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ComplexHeatmap")

# We need to specifically install a package called pbkrtest for the facto packages
# This is due to using an older verion of R 
library(remotes)
install_version("pbkrtest", "0.5.1")

install.packages("FactoMineR", dependencies = TRUE)
install.packages("factoextra", dependencies = TRUE)

install.packages("Rtsne")
install.packages("umap")
In [1]:
# Packages to help tidy our data
library(tidyverse)
library(readxl)
library(magrittr)

# Packages for the graphical analysis section
library(repr)
library(viridis)
# library(gplots) # heatmap2()
library(ComplexHeatmap) # Heatmap()
library(RColorBrewer)

# Useful for PCA and PCA visualization
library(FactoMineR)
library(factoextra)

# Data projection packages
library(Rtsne)
library(umap)
-- Attaching core tidyverse packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 2.0.0 --
v dplyr     1.1.1     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.1     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.1     
-- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'magrittr'


The following object is masked from 'package:purrr':

    set_names


The following object is masked from 'package:tidyr':

    extract


Loading required package: viridisLite

Loading required package: grid

========================================
ComplexHeatmap version 2.6.2
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite:
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
  genomic data. Bioinformatics 2016.

This message can be suppressed by:
  suppressPackageStartupMessages(library(ComplexHeatmap))
========================================


Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa


1.0.0 Data categorization, dimension reduction, and dimension transformation¶

Last week we looked at an analysis of RNAseq data through a number of methods starting with broad-level volcano plots and moving towards gene-level expression visualizations with dotplots. In between we stopped to take a look at heatmaps. In this instance we simply used heatmaps to convey expression levels of multiple genes across one or more samples.

Looking more broadly, we now wish to ask questions such as "how similar are our samples?", "can our samples be grouped or categorized in some way?" and "is there an underlying structure or architecture to our data?"

We can study these espects using a number of techniques that range from clustering/categorization to projection of high-dimensional data onto a lower set of dimensions (usually 2 or 3).


1.1.0 What kind of data do we care about?¶

We cannot begin our journey until we talk about the nature of the data we are interested in examining. Usually, our data will consist of many samples (observations) with some number of features captured about each observation. For example, with RNAseq data we could consider the measurements we capture for each gene as a separate feature/variable/column.

Conversely you may have hundreds or thousands of samples you'd like to categorize in some way (or show that you can categorize) with just a smaller set of features. For every nut, there is a tool of sorts for cracking it!

We'll start with a modified version of the PHU age group dataset that we've seen before from Lecture 02 and 03. The demographics data has been modified to include a set of normalized values (individuals per 100k) that was calculated based on StatsCan 2017 census data. Modeling off of these data, the 2022 sizes for each age group were estimated. Case, death, and hospitalization counts were normalized based on these subgroup sizes to generate values per 100K individuals. We'll be working with this 2022 dataset mainly because it utilizes more fine-grained binning of our age-groups than more recent dataset from the Ontario government archives.

The updated dataset will be found in phu_demographics_census_norm.csv and we'll use it to guide us through two sections of today's lecture.

Let's begin by loading the data and taking a look at its structure.

In [2]:
# Import the normalized demographics data
covid_demographics_norm.df = read_csv("data/phu_demographics_census_norm.csv")

# Look at it's structure
str(covid_demographics_norm.df, give.attr = FALSE)
Rows: 238 Columns: 15
-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Delimiter: ","
chr  (4): from_date, to_date, public_health_unit, age_group
dbl (11): total_cases, total_deaths, total_hospitalizations_count, percent_c...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
spc_tbl_ [238 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ from_date                   : chr [1:238] "January 15, 2020" "January 15, 2020" "January 15, 2020" "January 15, 2020" ...
 $ to_date                     : chr [1:238] "March 2, 2022" "March 2, 2022" "March 2, 2022" "March 2, 2022" ...
 $ public_health_unit          : chr [1:238] "Algoma" "Algoma" "Algoma" "Algoma" ...
 $ age_group                   : chr [1:238] "0 to 4" "12 to 19" "20 to 39" "40 to 59" ...
 $ total_cases                 : num [1:238] 181 412 1868 1424 379 ...
 $ total_deaths                : num [1:238] 0 0 1 5 0 14 13 0 0 2 ...
 $ total_hospitalizations_count: num [1:238] 9 5 31 48 0 69 44 10 2 35 ...
 $ percent_cases               : num [1:238] 0.0343 0.0781 0.3543 0.2701 0.0719 ...
 $ percent_deaths              : num [1:238] 0 0 0.0303 0.1515 0 ...
 $ percent_hospitalizations    : num [1:238] 0.0437 0.0243 0.1505 0.233 0 ...
 $ population_2022             : num [1:238] 117840 117840 117840 117840 117840 ...
 $ individuals_2022            : num [1:238] 5481 9230 24676 33111 7751 ...
 $ cases_per_100k              : num [1:238] 3302 4464 7570 4301 4890 ...
 $ deaths_per_100k             : num [1:238] 0 0 4 15 0 46 177 0 0 5 ...
 $ hospitalizations_per_100k   : num [1:238] 164 54 126 145 0 228 598 116 13 95 ...

1.2.0 Looking for trends/groups in your data¶

Let's begin looking at our covid_demographics_norm.df. Using a series of data sets, we've created a consolidated dataset with:

  1. Cumulative cases, deaths, and hospitalizations due to COVID-19 within each age group per PHU.
  2. Representation of each age group within each data category as a percent of total incidents in each PHU.
  3. Using 2017 census data, the number of cases per 100,000 individuals normalized by estimated population size for each age group within each PHU.

The question we want to answer is: of the 34 public health units, which look most similar based on the normalized case data for each age group? In order to visualize this data, we'll want to convert our current long-form data that looks something like this:

public_health_unit age_group total_cases ... cases_per_100k deaths_per_100k hospitalizations_per_100k
Algoma 0 to 4 181 ... 3302 0 164
Algoma 12 to 19 412 ... 4464 0 54
... ... ... ... ... ... ...

Into something like this:

public_health_unit population_2022 category 0 to 4 5 to 11 12 to 19 20 to 39 40 to 59 60 to 79 80+
Algoma 117840 cases 3302 4464 7570 4301 4890 2331 4116
... ... ... ... ... ... ... ... ... ...

We need to do the following to the dataset

  1. Select just the variables we are interested in.
  2. pivot out the age group data into its own columns so we have 1 observation (row) per PHU.
  3. Correct the position of the age groups (5 to 11 needs to be moved).
In [3]:
# Create a wide-format version of our normalized data
covid_demographics_norm_wide.df <-

    # Start by passing along the long-form normalized data
    covid_demographics_norm.df %>% 

    # Select for just the PHU, age_group, population size and cases/hosp/deaths_per_100k
    select(c(3,4, 11, 13:15)) %>% 

    # Pivot the data to a longer format
    pivot_longer(cols = -c(1:3), names_to = "category", values_to = "per_100k") %>% 

    # Get rid of the suffix of "_per_100k"
    mutate(category = str_remove_all(string = category, pattern = "_per_100k")) %>% 

    # Pivot the age_group/per_100k data out wider
    pivot_wider(names_from = age_group,
                values_from = per_100k
               ) %>% 

    # Move the "5 to 11" category to after "0 to 4". You could use a longer "select" call to do this too!
    relocate(., `5 to 11`, .after=`0 to 4`)

# Take a look at our resulting dataframe
head(covid_demographics_norm_wide.df)
A tibble: 6 × 10
public_health_unitpopulation_2022category0 to 45 to 1112 to 1920 to 3940 to 5960 to 7980+
<chr><dbl><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Algoma 117840cases 3302489044647570430123314116
Algoma 117840deaths 0 0 0 4 15 46 177
Algoma 117840hospitalizations 164 0 54 126 145 228 598
Brant County153558cases 4481537664999448642939955684
Brant County153558deaths 0 0 0 5 16 93 538
Brant County153558hospitalizations 116 15 13 95 187 4601393

1.2.1 Cast our data to a matrix for heatmaps¶

At this point, we would normally be prepared to visualize our data with ggplot but our data is in wide-format and we'll be using a package that prefers to work with matrix data. In that case, we need to strip the data down further because matrix data must be all of the same type and we want to work with numeric data! We'll use as.matrix() for our conversion.

  1. We'll filter our dataset to only include the case data from above, and then drop the category variable from it.
  2. We'll move the public_health_unit information over to the row names of our matrix so we can still track our data properly.
  3. We'll still keep our population_2022 column of data but we'll have to remember that it's there.
In [6]:
# Now we need to make our matrix and assign row names. 
# It's kind of awkward to need this requirement.

# Cast our matrix and drop the first column
covid_demographics_norm.mx <- covid_demographics_norm_wide.df %>% 

# Filter for just case data
dplyr::filter(category == "cases") %>% 

# Drop the PHU and category data
dplyr::select(-c(public_health_unit, category)) %>%

# Convert to a matrix
as.matrix()

# Set the row names using the information from the data frame
rownames(covid_demographics_norm.mx) <- covid_demographics_norm_wide.df %>% 
    pull(public_health_unit) %>% unique()

# Take a peek at our data
head(covid_demographics_norm.mx)
A matrix: 6 × 8 of type dbl
population_20220 to 45 to 1112 to 1920 to 3940 to 5960 to 7980+
Algoma117840330248904464 7570430123314116
Brant County153558448153766499 9448642939955684
Durham Region71142646695697640711018733349857562
Grey Bruce176150232430404643 5995348920223056
Haldimand-Norfolk120008300249565388 9803579935686261
Haliburton, Kawartha, Pine Ridge District190728226328653673 7368368119233612

1.3.0 Plot and reorder our PHUs as a heatmap with Heatmap()¶

From the ComplexHeatmap package we can use the function Heatmap() to generate a heatmap that can do a little more than what we were doing with our ggplot2 version. A nice aspect of using Heatmap() is that we can ask it to reorder our samples along both the rows and columns. At the same time we can present the relationship between columns elements or row elements in the form of dendrograms.

1.3.1 How do we reorder our data via clustering?¶

In its current form, we have our ordered our data along the columns in an ascending ordinal arrangement. While this makes sense to us now, it may not help to visually identify the strongest trends or groupings in our data. Clustering attempts to bring order to our data by grouping data according to specific algorithms.

Overall single points are usually grouped together as neighbours with the "nearest" neighbours determined by a metric of some kind. These clusters are then further grouped using the same metrics until the entire set is presented in these ordered groups. These groupings/relationships can also be presented as dendrograms. In our current case, we aren't concerned with determining groups per se but rather with just connecting them by their similarity and then creating a hierarchical order.

Our data is usually coded in n-dimensional space, depending on the nature of our dataset. For covid_demographics_norm.mx our 7 columns are coded by data from 34 PHUs meaning each column is coded in 34 dimensions or 34 features. Conversely, our 34 rows represent PHUs each coded by data across 7 age groups and therefore 7 dimensions.

Important or helpful parameters to run Heatmap():

  • matrix: our matrix object. It must be a matrix, and not a data frame. It could also be a vector but that becomes a single column.
  • col: determines the colours used for the image.
  • name: the name/title of the heatmap (also use as the legend title by default)
  • row_* : set our row text properties including titles and labels:
    • row_title, row_title_side, row_title_gp (graphic properties)
    • row_names_side row_names_gp
    • column properties can also be changed using column_*
  • cluster_rows and cluster_columns: logical parameters to determine if clustering should occur (default is TRUE)
  • show_heatmap_legend: whether or not to show the heatmap legend
  • heatmap_legend_param: Set the title of the legend specifically is list(title = "x")
  • show_[row/col]_dend: Whether or not to show dendograms for the row/column

There are actually a lot of options but these should help us make a basic one. Let's plot our data with and without a dendrogram to compare.

In [7]:
?Heatmap
In [8]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=12, repr.plot.height=10)

# Create a Heatmap object
cases_hmap <- Heatmap(covid_demographics_norm.mx[, -1],               # Supply our matrix minus the populations size                    
                      cluster_rows = FALSE, cluster_columns = FALSE, # Don't cluster on either rows or columns
                      
                      # Use column_title as the title of our heatmap
                      column_title = "Heatmap of COVID-19 cases in PHUs by age group: unclustered",
                      
                      # Rotate the legend horizontally and give it a title
                      heatmap_legend_param = list(title = "cases per 100K individuals",
                                                  legend_direction = "horizontal"),
                      
                      # Rotate column names to horizontal
                      column_names_rot = 0,
                      column_names_center = TRUE
                     )

# Plot the heatmap 
draw(cases_hmap, 
     # Plot the legend on the bottom
     heatmap_legend_side = "bottom"
    )
In [9]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=12, repr.plot.height=10)

# Create a Heatmap object
cases_hmap <- Heatmap(covid_demographics_norm.mx[,-1],               # Supply our matrix minus the populations size                    
                      cluster_rows = TRUE, cluster_columns = TRUE,   # Cluster on both rows and columns
                      
                      # Use column_title as the title of our heatmap
                      column_title = "Heatmap of COVID-19 cases in PHUs by age group: clustered",
                      
                      # Rotate the legend horizontally and give it a title
                      heatmap_legend_param = list(title = "cases per 100K individuals",
                                                  legend_direction = "horizontal"),
                      
                      # Rotate column names to horizontal
                      column_names_rot = 0,
                      column_names_center = TRUE
                     )

# Plot the heatmap 
draw(cases_hmap, 
     # Plot the legend on the bottom
     heatmap_legend_side = "bottom"
    )

1.3.2 Concatenate multiple heatmaps together into a HeatmapList object¶

Our heatmap above is drawn from a single dataset category - cases - but it helps us to see that there are some strong signals that differentiate between our different PHUs. Now this is a 34x7 grid where we can investigate all of the data in our dataset. What happens when we want to produce this kind of data from our other two metrics of hospitalizations and deaths?

To accomplish this, we can repeat our steps and create 3 separate Heatmap objects but we can plot them together as a single complex heatmap. In fact, rather than store multiple heatmap objects, we'll create a HeatmapList object by concatenating together multiple Heatmap objects using a for loop.

We'll recreate our code from above but simplify the heatmap code a little.

In [10]:
# Create a list of heatmap objects from our demographic data
hm_list <- NULL

# Create some quick vectors to help ourselves out
categories <- c("cases", "hospitalizations", "deaths")

# Store the rownames we want to use on our matrices
mx_rownames <- covid_demographics_norm_wide.df %>% pull(public_health_unit) %>% unique()

# Use a loop to separate the data
for (i in categories) {
  
    # Create a temporary matrix of the specific category data
    temp_mx <- covid_demographics_norm_wide.df %>% 
        dplyr::filter(category == i) %>% 
        dplyr::select(-c(public_health_unit, category)) %>% 
        as.matrix()
    
    # Set the rownames
    rownames(temp_mx) <- mx_rownames
    
    # Create a Heatmap object
    hmap <- Heatmap(temp_mx[,-1],   # Supply our matrix minus the populations size                    

                      # Use column_title as the title of our heatmap
                      column_title = i,
                      
                      # Rotate the legend horizontally and give it a title based on category
                      heatmap_legend_param = list(title = paste0(i, " per 100K individuals"),
                                                  legend_direction = "horizontal"),
                     )
    
    # Add this to our heatmap list
    hm_list <- hm_list + hmap
}

# What kind of object is hm_list?
class(hm_list)
'HeatmapList'

1.3.3 draw() your HeatmapList¶

Now that we have our HeatmapList we can simply draw the whole thing and it will automatically concatenate for us. Of course there are many options we can use to display the list. One thing to note is that the clustering of columns in each heatmap is independent while the clustering of rows (and thus order) is based on only the first heatmap (by default).

Using the draw() method to display your HeatmapList will allow you to further customize details of the overall figure including setting row_title and column_title for the entire figure.

In [11]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=18, repr.plot.height=10)

draw(hm_list, 
     column_title = "COVID-19 metrics breakdown by age group and PHU\n",
     column_title_gp = gpar(fontsize = 20)
    )

1.4.0 Building correlation heatmaps with complex RNA-Seq data¶

Our heatmap above is drawn from a relatively simple dataset but it helps us to see that there are some strong signals that differentiate between our different PHUs. It become clear that while the bulk of cases in each PHU is dominated by the 20-39 age segment, the bulk of hopsitalizations and deaths originate from the 80+ segment.

Of course, this data was cumulative information from January 2020 to March 2022. With the aftermath of vaccinations and different variants, a look at the current demographics may reveals a shift in these trends.

Considering that is a 34x7 grid, it is easier to visualize and dissect all of the data in our dataset. What happens, however, when we want to produce this kind of visualization from something much larger or more complex?

Recall from last lecture we investigated read count data from Wyler et al., 2020 - Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv. We used this dataset to produce a scatterplot matrix to compare between some of the replicate data within the set. Across this set there are 36 sets of RNA-Seq data spanning 27,011 genes.

Let's begin by opening up the data file and getting a quick reminder of what it looks like.

In [12]:
# Read in your read_count data
wyler_readcounts.df <- read_tsv("data/Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv")

str(wyler_readcounts.df, give.attr = FALSE)
Rows: 27011 Columns: 38
-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Delimiter: "\t"
chr  (1): gene
dbl (37): width, AECII_01_24h_DMSO-1, AECII_02_24h_DMSO-2, AECII_03_24h_DMSO...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
spc_tbl_ [27,011 x 38] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ gene                        : chr [1:27011] "A1BG" "A1BG-AS1" "A1CF" "A2M" ...
 $ width                       : num [1:27011] 1766 2130 9529 4653 2187 ...
 $ AECII_01_24h_DMSO-1         : num [1:27011] 22 55 0 0 0 ...
 $ AECII_02_24h_DMSO-2         : num [1:27011] 19 20 0 1 0 ...
 $ AECII_03_24h_DMSO-3         : num [1:27011] 15 38 0 0 2 ...
 $ AECII_04_24h_200nM17AAG-1   : num [1:27011] 29 11 0 0 2 4740 0 0 882 0 ...
 $ AECII_05_24h_200nM17AAG-2   : num [1:27011] 27 15 0 0 0 ...
 $ AECII_06_24h_200nM17AAG-3   : num [1:27011] 30 17 0 0 2 4220 0 0 820 0 ...
 $ AECII_07_24h_S2_DMSO-1      : num [1:27011] 24 41 0 0 0 ...
 $ AECII_08_24h_S2_DMSO-2      : num [1:27011] 18 46 0 0 2 ...
 $ AECII_09_24h_S2_DMSO-3      : num [1:27011] 22 42 0 0 4 2750 0 0 739 0 ...
 $ AECII_10_24h_S2_200nM17AAG-1: num [1:27011] 23 13 0 1 3 ...
 $ AECII_11_24h_S2_200nM17AAG-2: num [1:27011] 25 24 0 0 1 ...
 $ AECII_12_24h_S2_200nM17AAG-3: num [1:27011] 18 28 1 0 1 ...
 $ AECII_13_48h_DMSO-1         : num [1:27011] 24 57 0 0 0 ...
 $ AECII_14_48h_DMSO-2         : num [1:27011] 38 56 0 0 0 1930 0 0 892 0 ...
 $ AECII_15_48h_DMSO-3         : num [1:27011] 33 64 0 0 0 ...
 $ AECII_16_48h_200nM17AAG-1   : num [1:27011] 29 35 0 0 1 ...
 $ AECII_17_48h_200nM17AAG-2   : num [1:27011] 20 24 0 0 0 ...
 $ AECII_18_48h_200nM17AAG-3   : num [1:27011] 23 28 0 0 0 ...
 $ AECII_19_48h_S2_DMSO-1      : num [1:27011] 13 45 0 0 0 ...
 $ AECII_20_48h_S2_DMSO-2      : num [1:27011] 22 36 0 0 0 ...
 $ AECII_21_48h_S2_DMSO-3      : num [1:27011] 21 30 0 0 1 ...
 $ AECII_22_48h_S2_200nM17AAG-1: num [1:27011] 31 29 0 0 1 ...
 $ AECII_23_48h_S2_200nM17AAG-2: num [1:27011] 36 23 0 0 1 ...
 $ AECII_24_48h_S2_200nM17AAG-3: num [1:27011] 20 19 1 0 0 ...
 $ AECII_25_72h_DMSO-1         : num [1:27011] 35 43 0 1 2 ...
 $ AECII_26_72h_DMSO-2         : num [1:27011] 26 38 0 0 1 997 0 0 753 0 ...
 $ AECII_27_72h_DMSO-3         : num [1:27011] 29 53 0 0 4 1230 0 0 953 0 ...
 $ AECII_28_72h_200nM17AAG-1   : num [1:27011] 32 57 0 0 2 ...
 $ AECII_29_72h_200nM17AAG-2   : num [1:27011] 41 49 0 0 2 1050 0 0 553 0 ...
 $ AECII_30_72h_200nM17AAG-3   : num [1:27011] 33 37 0 0 3 857 0 0 458 0 ...
 $ AECII_31_72h_S2_DMSO-1      : num [1:27011] 24 58 0 0 1 ...
 $ AECII_32_72h_S2_DMSO-2      : num [1:27011] 33 63 0 0 3 ...
 $ AECII_33_72h_S2_DMSO-3      : num [1:27011] 23 47 1 1 2 ...
 $ AECII_34_72h_S2_200nM17AAG-1: num [1:27011] 22 39 0 1 1 ...
 $ AECII_35_72h_S2_200nM17AAG-2: num [1:27011] 28 26 0 0 2 866 0 0 481 0 ...
 $ AECII_36_72h_S2_200nM17AAG-3: num [1:27011] 43 34 0 0 3 ...

1.4.1 Wrangle our readcount data¶

As you might recall from last week, there is quite a bit of data in this set. So that we can more easily visualize our data, we'll want to wrangle it a bit more by:

  1. updating the column names to remove some of the unnecessary title information.
  2. filtering by readcounts to limit our genes to those with values between 1000 and 3000.
  3. Convert our dataframe to a matrix and rename the rows using the genes.
In [13]:
wyler_readcounts_filtered.df <-
    
    wyler_readcounts.df %>% 

    # Rename the columns be removing the first portion: AECII_xx
    rename_with(., ~ str_replace(string = .x, 
                                 pattern = r"(\w*_\d*_)", 
                                 replace = "")) %>% 

    # filter out the low-readcount data
    filter(if_all(.cols = -1, .fns = ~ .x > 1000 & .x < 3000))

# Create our data matrix
wyler_readcounts.mx <- as.matrix(wyler_readcounts_filtered.df[, 3:38])

# Set the row names using the information from the data frame
rownames(wyler_readcounts.mx) <- wyler_readcounts_filtered.df$gene

# Check the characteristics of our matrix
head(wyler_readcounts.mx)
str(wyler_readcounts.mx)
A matrix: 6 × 36 of type dbl
24h_DMSO-124h_DMSO-224h_DMSO-324h_200nM17AAG-124h_200nM17AAG-224h_200nM17AAG-324h_S2_DMSO-124h_S2_DMSO-224h_S2_DMSO-324h_S2_200nM17AAG-1...72h_DMSO-372h_200nM17AAG-172h_200nM17AAG-272h_200nM17AAG-372h_S2_DMSO-172h_S2_DMSO-272h_S2_DMSO-372h_S2_200nM17AAG-172h_S2_200nM17AAG-272h_S2_200nM17AAG-3
AP1S11957131016232520244826181644158517412536...1953183517381490157116571394153113311633
AP2S11575116214081669144116431187120013791551...1561171914891318143413891219152212661536
APH1A2017153318902281208922661794169719432043...2253253223732010232923162109230618042375
ARL8B1835150616972823270829371616156618122994...2043292727132275218121941903265820722629
ARPC42503165120051732160316632051189922851602...2225193118121658202720951664199015491899
ATP6AP11783137317142618242626381503146716732291...2524223120061709261927132368168515601855
 num [1:109, 1:36] 1957 1575 2017 1835 2503 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:109] "AP1S1" "AP2S1" "APH1A" "ARL8B" ...
  ..$ : chr [1:36] "24h_DMSO-1" "24h_DMSO-2" "24h_DMSO-3" "24h_200nM17AAG-1" ...

1.4.2 Plot a heatmap of your readcount data¶

What if we were to produce a heatmap directly with the raw data? It would be impossible to read, and including a dendrogram would actually take forever to process given the 27,011 rows of data to process across the 36 datasets. That's why we'll use our small subset of data as an example to build a heatmap.

Let's plot the data now with heatmap.2 and see how it looks with a dendrogram.

In [14]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)

# Create a Heatmap object
wyler_hmap <- Heatmap(wyler_readcounts.mx,               # Supply our matrix 
                      cluster_rows = TRUE, cluster_columns = TRUE,   # Cluster on both rows and columns
                      col = viridis(100),
                      
                      # Use column_title as the title of our heatmap
                      column_title = "Heatmap of Wyler et al., 2020 readcount data",
                      
                      # Rotate the legend horizontally and give it a title
                      heatmap_legend_param = list(title = "readcounts per gene",
                                                  legend_direction = "horizontal"),

                     )

# Plot the heatmap 
draw(wyler_hmap, 
     # Plot the legend on the bottom
     heatmap_legend_side = "bottom"
    )

1.4.3 Generate a correlation matrix of your data¶

From the above output, we can see that our heatmap is already quite hard to read and that's coming off of using just 109 genes! We do, however, get some strong trends regarding our datasets - we can see certain sets of replicates are grouped together in the data but not all of them. This might be clearer if we were to factor in all of the datapoints or just the relevant genes that define the differences between datasets. We'll discuss the second part of that thought later in this lecture.

Recall that what we're really interested in, regarding these readcount data, is to ask just how similar the experiments are between each other. Whereas we were a little limited by the space needed to produce the scatterplot matrix, we were able to produce correlation values between experiments on a small scale. We can extend that visualization forward to compare between all datasets and then visually summarize the data as a heatmap.

The portion of the scatterplot matrix we'll extend is the correlation values. Whereas the ggpairs() function uses a Pearson correlation, we can choose between Pearson or Spearman correlation. Sometimes you may wish to compare both since Pearson examines linear relationships whereas Spearman uses a ranking method to compare variables for monotonic relationships.

To generate our matrix we'll employ a couple of additional functions:

  • expand.grid(): we can use this to build a matrix of pair-wise combination values. We'll use these to generate the pair-wise column combinations we want to compare with cor().
  • apply(): by now you should be familiar with this function. We'll use it to iterate through our pair-wise column combinations and send each of those to...
  • cor(): this will return the correlation co-efficient based on the method we choose (pearson, kendall, spearman)
In [15]:
# Example of how you can use expand.grid to make pairwise combination between two sets of data.
expand.grid(c(1:3),c(4:6))
A data.frame: 9 × 2
Var1Var2
<int><int>
14
24
34
15
25
35
16
26
36
In [16]:
# First we need to fix the column names from our data
wyler_readcounts_only.df <-     
    
    wyler_readcounts.df %>% 
    # Rename the columns be removing the first portion: AECII_xx
    rename_with(., ~ str_replace(string = .x, 
                                 pattern = r"(\w*_\d*_)", 
                                 replace = "")) %>% 
    # Drop the first two columns as well
    select(c(3:38))

# Create our correlation matrix
wyler_cor.mx <- matrix(apply(expand.grid(c(1:36),c(1:36)),                     # generate the pair-wise combinations
                             MARGIN = 1,                                       # explore it row-by-row
                             function(x) cor(wyler_readcounts_only.df[, x[1]], 
                                             wyler_readcounts_only.df[, x[2]],
                                             method = "pearson")               # Pearson correlation
                             ),
                       nrow = 36,                                              # Cast our result as a matrix
                       dimnames = list(colnames(wyler_readcounts_only.df),     # name the rows and columns
                                       colnames(wyler_readcounts_only.df))
                       )

# take a look at the final matrix
head(wyler_cor.mx)
A matrix: 6 × 36 of type dbl
24h_DMSO-124h_DMSO-224h_DMSO-324h_200nM17AAG-124h_200nM17AAG-224h_200nM17AAG-324h_S2_DMSO-124h_S2_DMSO-224h_S2_DMSO-324h_S2_200nM17AAG-1...72h_DMSO-372h_200nM17AAG-172h_200nM17AAG-272h_200nM17AAG-372h_S2_DMSO-172h_S2_DMSO-272h_S2_DMSO-372h_S2_200nM17AAG-172h_S2_200nM17AAG-272h_S2_200nM17AAG-3
24h_DMSO-11.00000000.99160270.99715390.86891040.86989850.86649660.99787940.99814990.99788880.8777276...0.91877390.79150540.79369380.79610300.79524500.81355760.78948760.79823010.79651160.8005285
24h_DMSO-20.99160271.00000000.99779650.89853010.89827970.89518740.99271090.99413800.99038600.9033440...0.95618260.81725530.81916930.82109940.83118980.84424500.82433650.82192310.81920790.8235471
24h_DMSO-30.99715390.99779651.00000000.88565100.88574120.88254150.99734880.99801460.99570360.8917670...0.94045270.80460520.80651640.80888350.81570150.83156870.80955420.80983210.80786030.8121372
24h_200nM17AAG-10.86891040.89853010.88565101.00000000.99914360.99931640.86981560.87081040.85993580.9965957...0.92598750.90712230.90617160.90711570.79292730.79426550.78331150.90387990.89921390.9050855
24h_200nM17AAG-20.86989850.89827970.88574120.99914361.00000000.99925930.86997110.87130500.86099520.9971129...0.92366860.90307840.90229900.90313010.78788390.78965960.77818740.90101120.89557250.9018463
24h_200nM17AAG-30.86649660.89518740.88254150.99931640.99925931.00000000.86678580.86748530.85676630.9965608...0.92318600.90811820.90714500.90826670.78610120.78704520.77601000.90541970.90060330.9066919

1.4.4 Generate a heatmap of your correlation matrix¶

Now that we've completed the correlation matrix and it appears to be correct, we can generate the heatmap of the data, allowing it to group data based on the Pearson correlation values.

In [17]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)

# Create a Heatmap object
wyler_hmap <- Heatmap(wyler_cor.mx,               # Supply our matrix 
                      cluster_rows = TRUE, cluster_columns = TRUE,   # Cluster on both rows and columns
                      col = viridis(100),
                      
                      # Use column_title as the title of our heatmap
                      column_title = "Heatmap of RNA-Seq Pearson correlation on readcounts",
                      
                      # Rotate the legend horizontally and give it a title
                      heatmap_legend_param = list(title = "Pearson score",
                                                  legend_direction = "horizontal"),
                      
                      # Set the row/column label font size
                      row_names_gp = gpar(fontsize = 16),
                      column_names_gp = gpar(fontsize = 16)
                     )

# Plot the heatmap 
draw(wyler_hmap, 
     # Plot the legend on the bottom
     heatmap_legend_side = "bottom"
    )

1.5.0 Make a cluster dendrogram with hclust()¶

As you can see, clustering our data (on the right metric!) makes a big difference in how our data is displayed. In our last few heatmaps, we allowed Heatmap() to generate it's own clustering. We could have supplied it with a specific function to calculate distances, or even a dendrogram object to order the data as well. Under the hood, the function is defaulting to euclidean distances and actually passing that information on to the hclust() function to produce the dendrograms.

The choice of method determines how the data is clustered. The choices include: ward.D, ward.D2, single, complete (default), average, mcquitty, median and centroid.

In general ward.D, ward.D2 and complete strategies try to find compact small "spherical" clusters. The single linkage adopts a 'friends of friends' clustering strategy. The other methods can be said to aim for somewhere in between. For more information on these, you can dig into the hclust() documentation

We can turn to hclust() to generate just dendrograms for us but prior to that we still need to reformat our matrix a little using the dist() function.


1.5.1 Calculate the distance between our points with dist()¶

Remember we talked about our data being in n-dimensional space? Using those coordinates, there are a number of ways to calculate the distance between points. The simplest form is to calculate the euclidean distance between points using the generic formula:

$$d(p,q) = \sqrt{(p_{1}-q_{2})^2 + (p_{2}-q_{2})^2 + \ldots + (p_{n}-q_{n})^2}$$

but there are a number of other options as well. For isntance you can also choose the maximum distance between two components (same dimension). The dist() function can generate these values for you using the parameter method to determine how the distance will be used. Your options are: euclidean, maximum, manhattan, canberra, binary or minkowski.

The dist() function will return a dist object which is a lower triangle distance matrix between all of the rows/observations using the columns as coordinates.

Let's return to our PHU data to try it out and see how it works.

In [18]:
dist(covid_demographics_norm.mx[,-1], 
     method="euclidean")  %>% str()# The default method is euclidean
 'dist' num [1:561] 4365 6804 2961 3783 2579 ...
 - attr(*, "Size")= int 34
 - attr(*, "Labels")= chr [1:34] "Algoma" "Brant County" "Durham Region" "Grey Bruce" ...
 - attr(*, "Diag")= logi FALSE
 - attr(*, "Upper")= logi FALSE
 - attr(*, "method")= chr "euclidean"
 - attr(*, "call")= language dist(x = covid_demographics_norm.mx[, -1], method = "euclidean")

1.5.2 Cluster your distance matrix¶

Now we are ready to cluster our distance matrix using the hclust() method

Parameters that are important

  • method: already described as the method which we want to use to cluster our data.
    • ward.D, ward.D2, single, complete (default), average, mcquitty, median and centroid.
  • k: the number of groups we would like our final data to be split into - represented by colours.
In [19]:
# Make a cluster object and take a look at it
phu.hc <- hclust(dist(covid_demographics_norm.mx[,-1]), method = "ward.D2")

# What does our hclust object contain?
str(phu.hc)
List of 7
 $ merge      : int [1:33, 1:2] -13 -9 -18 -25 -3 -2 -4 -7 -5 -29 ...
 $ height     : num [1:33] 870 1178 1304 1364 1433 ...
 $ order      : int [1:34] 1 4 9 21 6 13 20 28 16 23 ...
 $ labels     : chr [1:34] "Algoma" "Brant County" "Durham Region" "Grey Bruce" ...
 $ method     : chr "ward.D2"
 $ call       : language hclust(d = dist(covid_demographics_norm.mx[, -1]), method = "ward.D2")
 $ dist.method: chr "euclidean"
 - attr(*, "class")= chr "hclust"

1.5.3 Plot your cluster object using fviz_dend() from the package factoextra¶

Now that we have generated our clustering object, you can see that it holds a number of features including the height (length) of our branches, an ordering of our PHUs and the order in which they merge. The merge process is described in more detail as well with the hclust() documentation

To simplify the plotting process, we can use the fviz_dend() function which will parse through the hclust object to produce a proper dendrogram. We can even specify some grouping or colouring schemes based on how many groups, k, we believe are present in our data.

We can treat the resulting plot like a ggplot to update particular elements as well.

In [20]:
# Plot out our dendrogram with fviz_dend

fviz_dend(phu.hc,                        # provide our hclust object
          cex = 1.5,                     # set the text size to be larger
          k = 3,                         # How many clusters do we expect in our data 
          palette = "Set1",              # what colours will we use. Many options including RBrewer palettes
          horiz = TRUE,
          labels_track_height = 12000,
         ) +

    # Bump up the text size for my old eyes
    theme(text = element_text(size = 20))
Warning message:
"The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
i The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>."

2.0.0 Clustering data into groups¶

So we've visualized our dataset as a dendrogram and it gives us an idea of how the samples might be related (in n-dimension space) based on the case values we produced from normalization.

K-means clustering is unsupervised learning for uncategorized data. We don't really know the training labels of our data and are more interested in seeing which ones group together. How many clusters should we aim to generate? We've already seen that our data likely splits into 3 groups based on the heatmaps and hclust() data. Will we get the same thing with a k-means method?

When in doubt, you can quickly consult the factoMiner function fviz_nbclust() which can guess how many clusters are ideal for representing your data. Note that it may not produce your ideal number of clusters but if you're stuck, this is a good start.

There are three method options: wss, gap_stat, and silhouette which all aim to minimize the number of clusters.

  • wss elbow method aims to minimize intracluster variation. How compact is our clustering? A lower WSS is better.

  • silhouette measures how well an object lies within it's cluster by computing the average distance to other members in its own, $a(i)$ vs other clusters $b(i)$. We evaluate $\frac{b(i) - a(i)}{max\{a(i), b(i)}$ with higher values indicating better clustering.

  • gap_stat also compares total intra-cluster variation but against a null reference distribution. The optimal value attempts to choose the number of clusters such that the smallest k gap statistics is within one standard deviation of the gap at k+1.

In [21]:
# How many clusters should we generate?
fviz_nbclust(covid_demographics_norm.mx[,-1], kmeans, method="wss") + 
    theme(text = element_text(size=20))
In [22]:
# How many clusters should we generate?
fviz_nbclust(covid_demographics_norm.mx[,-1], kmeans, method="silhouette") + 
    theme(text = element_text(size=20))
In [23]:
# How many clusters should we generate?
fviz_nbclust(covid_demographics_norm.mx[,-1], kmeans, method="gap_stat") + 
    theme(text = element_text(size=20))

2.1.0 Generate your clusters using kmeans()¶

So from three different analyses we didn't really get a concensus on the best number of clusters but it looks like the answer ranges between 1 and 4 clusters. Note that if your data consistently suggest k = 1, then you shouldn't cluster at all! Since our data already suggests 3 clusters, let's start with that. We'll use the kmeans() function to accomplish this.

Similar in idea to hclust(), the kmeans() algorithm attempts to generate k-partioned groups from the data supplied with an emphasis on minimizing the sum of squares from points to the assigned cluster centers. This differs from hclust() which finishes building the entire relationship via dendrogram without actually choosing "clusters".

We're going to also use set.seed() for our random number generation. We tend to think of things as random, but computationally, randomness is built on algorithms. Therefore we can "recreate" our randomness if we use a pre-determined starting point also known as the "seed".

The parameters we're interested in using with kmeans() are:

  • x the numeric matrix of our data
  • centers determines the number of clusters we want to produce
  • nstart defines how many randomly chosen sets of k centres you'll use to start the analysis. Choosing multiple different sets allows the algorithm to avoid local minima.
  • iter.max is the max number of iterations allowed while trying to converge on the best minimum metric
In [24]:
# Compute k-means with k = 3

# Set a seed for reproducibility.
set.seed(123)

# Generate our k-means analysis
phu_cases.km <- 
kmeans(scale(covid_demographics_norm.mx[,-1]), # We'll scale our data for this (more on that later too!)
       centers = 3, 
       nstart = 25,
       iter.max = 500)
In [25]:
# What is the structure of our kmeans object?
str(phu_cases.km)
List of 9
 $ cluster     : Named int [1:34] 2 1 1 2 1 2 1 3 2 1 ...
  ..- attr(*, "names")= chr [1:34] "Algoma" "Brant County" "Durham Region" "Grey Bruce" ...
 $ centers     : num [1:3, 1:7] 0.356 -0.955 0.82 0.349 -1.098 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : chr [1:7] "0 to 4" "5 to 11" "12 to 19" "20 to 39" ...
 $ totss       : num 231
 $ withinss    : num [1:3] 46.49 13.18 9.07
 $ tot.withinss: num 68.7
 $ betweenss   : num 162
 $ size        : int [1:3] 18 11 5
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
In [26]:
# K-means clusters showing the group of each individuals
phu_cases.km$cluster
Algoma
2
Brant County
1
Durham Region
1
Grey Bruce
2
Haldimand-Norfolk
1
Haliburton, Kawartha, Pine Ridge District
2
Halton Region
1
City of Hamilton
3
Hastings Prince Edward
2
Chatham-Kent
1
Kingston, Frontenac and Lennox & Addington
1
Lambton
1
Leeds, Grenville & Lanark District
2
Middlesex-London
1
Niagara Region
1
North Bay Parry Sound District
2
Northwestern
1
Ottawa
1
Peel
3
Huron Perth
2
Peterborough
2
Porcupine
1
Renfrew County and District
2
Eastern Ontario
1
Simcoe Muskoka District
1
Sudbury & Districts
1
Thunder Bay District
1
Timiskaming
2
Region of Waterloo
1
Wellington-Dufferin-Guelph
1
Windsor-Essex County
3
York Region
3
Toronto
3
Southwestern
2

2.2.0 Plot your k-means results with fviz_cluster()¶

Notice the structure of this output? It includes cluster which denotes which row belongs to each of the 3 clusters chosen. The centre of each cluster is defined by centers which in this case is a 3x7 array where each row uses 7 values to define their position within our 7-dimension dataset. We can see each cluster's size is 18, 11, and 5 PHUs respectively. Notice, however, that none of the original coordinates for our data are retained in this object.

Rather than use ggplot directly, we'll use the ggplot-compatible fviz_cluster() function to help us transform the values of our points from a 7-dimension coordinate system to a 2-dimension visual format suitable for our simpler brains. Under the hood fviz_cluster() is performing a principle component analysis to group our data along the first two principal components (more on what that means later too!).

The parameters we should be concerned with in this function include:

  • object the partitioning object for our data. In the case of kmeans, it is our kmeans object.

  • data is the original dataset we used to generate the data. This will be necessary to plot the other data points.

  • ellipse.type determines how we'll outline our clusters. This comes with a number of options including:

    • convex: draws boundaries based on the outer points in your cluster
    • confidence: produces a confidence ellipse around the cluster centres. This can be further adjusted using the parameter ellipse.level whose default is 0.95.
    • t, norm: assume multivariate t-distribution and multivariate normal distributions to produce their ellipses using ellipse.level to set their radius.
    • euclid: sets a circle of radius ellipse.level around the centre of each cluster.

Let's see how our data has been partitioned by the k-means clustering.

In [27]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=15)

# Plot the cluster
# You can treat this object like a ggplot object afterwards

phu_kmeans.plot <-

    fviz_cluster(object = phu_cases.km, # our k-means object
                 data = covid_demographics_norm.mx[, -1], # Our original data needed for PCA to visualize
                 ellipse.type = "convex", 
                 ggtheme = theme_bw(), 
                 repel=TRUE, # Try to avoid overlapping text
                 labelsize = 20,
                 pointsize = 4,
                 main = "K-means clustering of PHU by normalized case number"
                 ) +

    # Set some ggplot theme information
    theme(text = element_text(size=20)) +

    # Set the colour and fill scheme to viridis
    scale_fill_viridis_d() +
    scale_colour_viridis_d()

# Print the plot!
phu_kmeans.plot

2.3.0 Clustering your RNA-Seq data¶

Let's return to the Wyler et al., 2020 readcount data and take a look at it through a different lens. Whereas before we had generated a heatmap of the data based on looking at genes expressed within a readcount range, we'll now take a different approach.

Let's look at the top 500 variable genes in the dataset to help cluster them. To accomplish that we'll want to generate the standard deviation in read count for each row (gene). We'll utilize a few verbs we haven't used before in this class:

  • rowwise(): in the same class of functions as group_by(), this will subtly alter the tibble so that when we perform math operations on it, these will be completed on a row-by-row basis.
  • c_across(): this helper version of combine (c()) combines values across columns within our tibble.
In [ ]:
# Review the wyler readcount data
head(wyler_readcounts.df)
In [28]:
# Save our results into a new dataframe
wyler_readcounts_filtered.df <-
    wyler_readcounts.df %>% 

    # Rename the columns by removing the first portion: AECII_xx
    rename_with(., ~ str_replace(string = .x, 
                                 pattern = r"(\w*_\d*_)", 
                                 replace = "")) %>% 

    # filter out the low readcount data. Low values will create wild variance easily
    filter(if_all(.cols = -1, .fns = ~ .x > 10)) %>% 

    # Prepare to do row-wise calculations
    rowwise() %>% 

    # Calculate the standard deviation across rows
    mutate(stdev = sd(c_across(2:38))) %>% 
    
    # Ungroup and sort the data by descending value
    ungroup() %>% 
    arrange(desc(stdev)) %>% 
    
    # Take the top 500 most variable genes
    slice(1:500)

Now we'll just convert our dataframe into a matrix of values so we can perform our various analyses.

In [29]:
# Save just the RNA-Seq data into a matrix
wyler_readcounts.mx <- as.matrix(wyler_readcounts_filtered.df[, 3:38])

# Set the row names using the information from the data frame
rownames(wyler_readcounts.mx) <- wyler_readcounts_filtered.df$gene

# Take a quick look at the resulting matrix and its properties
head(wyler_readcounts.mx)
str(wyler_readcounts.mx)
A matrix: 6 × 36 of type dbl
24h_DMSO-124h_DMSO-224h_DMSO-324h_200nM17AAG-124h_200nM17AAG-224h_200nM17AAG-324h_S2_DMSO-124h_S2_DMSO-224h_S2_DMSO-324h_S2_200nM17AAG-1...72h_DMSO-372h_200nM17AAG-172h_200nM17AAG-272h_200nM17AAG-372h_S2_DMSO-172h_S2_DMSO-272h_S2_DMSO-372h_S2_200nM17AAG-172h_S2_200nM17AAG-272h_S2_200nM17AAG-3
KRT6A12967171949100638 37054 33900 34311109804102597124371 41077...46340 4189 4187 3485442905182738740 4764 3735 4345
EEF1A1 7395455211 65183115435108071116563 59765 56679 65284109138...93440730116976757666451984464838619692105390468927
PSAP 2080715808 18643 47679 42159 45534 17508 16545 18473 45529...39455807037547161679304632886226229721795774072614
LAMC2 9549854101 71731 62764 60630 64633 79476 77353 96356 70394...49602388683784929147541686038948255387293191235696
APP 2560418709 22107 46677 41708 47907 21038 19610 22815 44120...39254762347216662192338013241028194688155624768841
KRT14 6667341095 56114 81476 75289 80292 58457 51209 61860 76687...39070310692787424329381394272134589278872241928122
 num [1:500, 1:36] 129671 73954 20807 95498 25604 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:500] "KRT6A" "EEF1A1" "PSAP" "LAMC2" ...
  ..$ : chr [1:36] "24h_DMSO-1" "24h_DMSO-2" "24h_DMSO-3" "24h_200nM17AAG-1" ...

2.3.1 Create a heatmap of the most variable genes¶

Since we're here, let's take a quick step back and look at the heatmap from our new dataset. Does it reveal anthing to us now that it is more nuanced?

In [30]:
# Create a Heatmap object
wyler_hmap <- Heatmap(wyler_readcounts.mx,               # Supply our matrix 
                      cluster_rows = TRUE, cluster_columns = TRUE,   # Cluster on both rows and columns
                      col = viridis(100),
                      
                      # Use column_title as the title of our heatmap
                      column_title = "Heatmap of RNA-Seq readcounts on top 500 most variable genes",
                      
                      # Rotate the legend horizontally and give it a title
                      heatmap_legend_param = list(title = "readcounts per gene",
                                                  legend_direction = "vertical"),
                      
                      # Remove the row names
                      show_row_names = FALSE, 
                      
                      # Set the dendrogram sizes
                      row_dend_width = unit(40, "mm"),
                      column_dend_height = unit(20, "mm")
                      
                     )

# Plot the heatmap 
draw(wyler_hmap, 
     # Plot the legend on the bottom
     heatmap_legend_side = "left"
    )

2.3.2 Generate a k-means cluster analyis¶

Looks like the heatmap still doesn't reveal a lot of information to us like how samples might be grouped. Let's proceed with k-means and see if that works better. We just need to repeat our steps on a different set of data. Since we now have our most diverse genes and their readcounts, we'll take a look at if we can cluster this information.

  1. Generate an estimate on the appropriate number of clusters
  2. Create our k-means cluster object
  3. Visualize the k-means object and see which experiments tend to group together.
In [31]:
# How many clusters should we generate?
fviz_nbclust(t(wyler_readcounts.mx), kmeans, method="wss")
fviz_nbclust(t(wyler_readcounts.mx), kmeans, method="silhouette")
fviz_nbclust(t(wyler_readcounts.mx), kmeans, method="gap_stat")

From the 3 methods we see that

  1. Our WSS method produced an elbow around k = 4
  2. Our silhouette method peaks at k = 4
  3. Our gap_stat method sees diminishing progress also at k = 4

Let's proceed with that in mind!

In [32]:
# Compute k-means with k = 4

# Set a seed for reproducibility.
set.seed(123)

# Generate our k-means analysis
wyler_readcounts.km <- 
kmeans(scale(t(wyler_readcounts.mx)), # We'll scale our data for this (more on that later too!)
       centers = 4, 
       nstart = 25,
       iter.max = 500)

# Take a look at the resulting k-means object
str(wyler_readcounts.km)
List of 9
 $ cluster     : Named int [1:36] 3 3 3 4 4 4 3 3 3 4 ...
  ..- attr(*, "names")= chr [1:36] "24h_DMSO-1" "24h_DMSO-2" "24h_DMSO-3" "24h_200nM17AAG-1" ...
 $ centers     : num [1:4, 1:500] 0.378 -1.069 0.996 -0.233 -1.399 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "1" "2" "3" "4"
  .. ..$ : chr [1:500] "KRT6A" "EEF1A1" "PSAP" "LAMC2" ...
 $ totss       : num 17500
 $ withinss    : num [1:4] 1165 1661 1975 215
 $ tot.withinss: num 5016
 $ betweenss   : num 12484
 $ size        : int [1:4] 6 12 12 6
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
In [33]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=15)

# Plot the cluster
# You can treat this object like a ggplot object afterwards

wyler_kmeans.plot <-

    fviz_cluster(object = wyler_readcounts.km, # our k-means object
                 data = t(wyler_readcounts.mx), # Our original data needed for PCA to visualize
                 ellipse.type = "convex", 
                 ggtheme = theme_bw(), 
                 repel=TRUE, # Try to avoid overlapping text
                 labelsize = 20,
                 pointsize = 4,
                 main = "K-means clustering of filtered Wyler readcount data"
                 ) +

    # Set some ggplot theme information
    theme(text = element_text(size=20)) +

    # Set the colour and fill scheme to viridis
    scale_fill_viridis_d() +
    scale_colour_viridis_d()

# Print the plot!
wyler_kmeans.plot

2.3.3 Interpreting our RNA-Seq clustering¶

Looking at the result, what's nice about this kind of output is that we can immediately see the separation or clustering of our data. It looks like 4 was the way to go as there are 4 tight groupings from our data. It might be possible to further subdivide the group in the top left of our figure but it's unlikely.

Based on our selection of genes, we observe:

  1. 24-, 48-, and 72-hour samples mock-treated (DMSO) and mock-infected, cluster together with 24-hour mock-treated samples infected with SARS-CoV-2.
  2. 48- and 72-hour samples mock-treated (DMSO) and infected by SARS-CoV-2 appear to share a similar profile.
  3. 24-hour samples treated with 200 nM 17AAG cluster together whether or not they are infected with SARS-CoV-2.
  4. 48- and 72-hour samples treated with 200 nM 17AAG cluster together whether or not they are infected with SARS-CoV-2.

These groupings suggest that perhaps the 24-hour SARS-CoV-infected timepoint is similar to uninfected controls. Meanwhile the 48- and 72-hour SARS-CoV-infected samples are similar but likely showing very distinct transcriptional changes due to infection. Treatment with the HSP90-inhibitor, however, appears to sufficiently alter expression profiles of infected and mock-infected cells to makes them cluster together. This would be a good starting point to begin digging further into the data.

Skeptic or believer? While our above analysis seems to make sense to us, it lacks a deeper understanding of what might be happening. For one thing, we haven't even looked closely at the genes used to create our clustering. How many of them are relevant to the infection process? Is the clustering due to off-target effects of the HSP90 inhibitor (17AAG)? While clustering is an effective way to quickly identify if subgroups exist within your data, it's important to follow up these findings with in-depth analyses of the data.
Yes the data clusters, but why does it cluster?

3.0.0 Dimension reduction trims the (data) fat¶

One problem we encountered in our above analysis of RNA-Seq data is that there were simply too many dimensions! With ~27k gene entries in our dataframe, using clustering to analyse the entire dataset would be memory-intensive if not impossible altogether. To circumvent this problem we attempted to subset our data in two ways - filtering by read counts, and comparing variance across datasets. These approaches were a form of dimensionality reduction - namely feature elimination. Our choices, however, may have inadvertantly been throwing away data that could prove insightful! This is where other dimensionality reduction methods can help guide our analyses.

You can achieve dimensionality reduction in three general ways:

  1. Feature elimination: you can reduce the feature space by removing unhelpful features. No information is gained but you trim down your dataset size.
  1. Feature selection: on the reverse side you can simply choose the most important features to you. How will you decide? Some schemes include ranking your features by importance but this may suffer from information loss if incorrect features are chosen.
  1. Feature extraction: create new independent features which are a combination of the old features! Attempt to extract the essential features of your data in a more informationally dense manner.
Technique Description Type Useful for
Random forest Popular machine learning algorithm for classification randomly chooses from subsets of features to classify data. The most frequently appearing features amongst the forests are chosen. Feature selection Only takes numeric inputs
PCA Principal component analysis attempts to maximize variation when transforming to a lower-dimensional space. All new features are independent of one another. Linear Feature Extraction Actually really good at finding outliers in RNAseq data
t-SNE t-Distributed Stochastic Neighbour Embedding is a non-linear technique similar to PCA suitable for high-dimension datasets. It attempts to minimize probabilities in mirroring nearest neighbour relationships in transforming from high to low-dimension spaces Non-linear Feature extraction Determine if your data has underlying structure
UMAP Uniform manifold approximation and projection is projection-based like t-SNE, this technique attempts to preserve local data structure but has improved translation of global data structure. Non-linear feature extraction Faster than t-SNE

Since we're not exactly building a classifier but rather trying to find trends in our data, we won't be looking at Random Forests here. Here we are interested in exploratory data analysis. We have a data set we want to understand, sometimes it is too complex to just project or divine 1) the underlying structure and 2) the features that drive that structure.


3.1.0 Reduce your dimensionality with PCA¶

Going back to our PHU data, we used 7 dimensions to classify our data but what if we wanted to transform that information in some way to reduce the number of dimensions needed to represent their relationships? For our data set, 7 dimensions isn't a lot of features but in other cases (like RNA-Seq) you might encounter feature-dense datasets and without knowing a priori which ones are meaningful, PCA provides a path forward in classifying our data.

Now there are caveats to PCA. It produces linear feature extraction by building new features in your n-dimensional space such that each new feature (kind of like a projected line) maximizes variance to the original data points. Each new component must be uncorrelated with (ie perpendicular to) the previous ones while accounting for the next highest amount of variance. More resources on how this works in the references.

How do we go about maximizing the variance (spread of red dots) to our data features? Finding the point where variance is maximized, also minimizes error (red line lengths). Generated by user Amoeba on stackexchange.com

All math aside, our goal is to reduce our feature set to something smaller by trying to represent our data with these new features. Just remember that highly variable data and outliers can dominate your principal components.

For simplicity let's head back and look at our PHU age group data again. To illustrate our example with PCA, let's use the original data normalized by PHU population.

In [34]:
head(covid_demographics_norm.mx)
A matrix: 6 × 8 of type dbl
population_20220 to 45 to 1112 to 1920 to 3940 to 5960 to 7980+
Algoma117840330248904464 7570430123314116
Brant County153558448153766499 9448642939955684
Durham Region71142646695697640711018733349857562
Grey Bruce176150232430404643 5995348920223056
Haldimand-Norfolk120008300249565388 9803579935686261
Haliburton, Kawartha, Pine Ridge District190728226328653673 7368368119233612

3.2.0 Use PCA() to generate our analysis¶

To generate our analysis of the PHU data, we'll use the FactoMineR function PCA() for which there are some parameters we'll be using that we should discuss:

  • X a data frame of n rows (observations) and p columns (numeric variables)
  • ncp the number of dimensions kept in the results (5 is the default)
  • scale.unit a boolean to scale your data (TRUE by default)

3.2.1 What does it mean to scale our data?¶

Remember that PCA is trying to maximize distance between a principle component and all of the observations supplied. Depending on the nature of your variables you may have, for instance, two different unit types like height and mass. Smaller changes in height may be matched with much larger changes in mass or just wider overall variance. This may lead the PCA algorithm to prioritize mass over height when you'd prefer they have an equal importance. By centering your mean and scaling data to unit variance, everything is compared as a z-score, bringing the overall variance across a variable to within ~3 standard deviations.

Let's compare PCA with and without scaling shall we?

In [35]:
# Build a PCA of our PHU data with scaling applied
phu_scaled.pca <- PCA(covid_demographics_norm.mx[, -1], 
                      scale.unit = TRUE, # What happens when we don't scale the data?
                      ncp = 7,
                      graph = TRUE)

# Build a PCA of our PHU data WITHOUT scaling applied
phu_unscaled.pca <- PCA(covid_demographics_norm.mx[,-1], 
                        scale.unit = FALSE,
                        ncp = 7,
                        graph = TRUE)
Warning message:
"ggrepel: 9 unlabeled data points (too many overlaps). Consider increasing max.overlaps"
Warning message:
"ggrepel: 17 unlabeled data points (too many overlaps). Consider increasing max.overlaps"
In [36]:
# Take a look at the information inside our PCA object
print(phu_scaled.pca)
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 34 individuals, described by 7 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"          

3.2.2 Our PCA object has a complex number of data pieces¶

Regardless of scaling, the result of our PCA() call produces an object with many variables we can access. Above you can see a brief description for each variable but we are most interested in a few particular ones:

  • eig holds our dimensional data but also describes just how much each new principle component describes the overall variation of our data.
  • var holds the results of all the variables. We can use these to graph and visualize our data.
  • ind$coord will allow us to plot the coordinates of our observations along the principal components.

3.3.0 Use the eigenvalues to determine the percent variance of each component¶

The eigenvalues from our analysis pair with the eigenvectors (principle components) to help transform our data from the original feature set to the new set of features. While the eigenvectors may determine the directions of the new feature space, the eigenvalue represents the magnitude of the vector and in this case can be used to calculate the percent of overall variance explained by our eigenvector.

The important take-away is that we can now see just how much of our variance is explained in each new principle component. We can access this information directly from phu_scaled.pca or by using the function get_eigenvalue(). We can also plot this as a barchart instead using fviz_eig().

In [37]:
# Look at our eigenvalues directly
phu_scaled.pca$eig

# Use get_eigenvalue() to look at our eigenvalues
get_eigenvalue(phu_scaled.pca)
A matrix: 7 × 3 of type dbl
eigenvaluepercentage of variancecumulative percentage of variance
comp 15.5707993279.5828474 79.58285
comp 20.8937555512.7679364 92.35078
comp 30.26020469 3.7172099 96.06799
comp 40.12869297 1.8384710 97.90646
comp 50.08966823 1.2809747 99.18744
comp 60.04430641 0.6329488 99.82039
comp 70.01257282 0.1796117100.00000
A matrix: 7 × 3 of type dbl
eigenvaluevariance.percentcumulative.variance.percent
Dim.15.5707993279.5828474 79.58285
Dim.20.8937555512.7679364 92.35078
Dim.30.26020469 3.7172099 96.06799
Dim.40.12869297 1.8384710 97.90646
Dim.50.08966823 1.2809747 99.18744
Dim.60.04430641 0.6329488 99.82039
Dim.70.01257282 0.1796117100.00000

3.3.1 Build a scree plot to look at the effect of your dimensions¶

See how nearly 80% of our variation is explained in just our first dimension? Let's use fviz_eig() to display this information visually in what is known as a scree plot. It's essentially a barplot/lineplot combo but what we're interested in is following the lines much like our cluster-estimating WSS method. We use a "sharp elbow" to determine how many principal components we need.

In [38]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=10, repr.plot.height=6)

# Visualize the impact of our eigenvalues
fviz_eig(phu_scaled.pca, addlabels = TRUE) + theme(text = element_text(size=20))

3.3.2 Most of our variance is accounted for in the first two principle components!¶

Looking at our eigenvalues we can see that even though 7 new PCs were generated, the first two explain almost 92% of our variance. That suggests that whatever linear separation in our data exists, we can recreate in a two-dimensional projection of coordinates from PC1 and PC2. This suggests that we can recreate the underlying structure of our data with these two new features instead of using a 7-dimensional space!

This makes some sense when we take a closer look at the data. For instance, there isn't much visual information found in our 0 to 4 age range. The small distribution of information in these features may not do much, in the grand scheme, to change how the PHUs are separated from each other. Then again, perhaps they do contain information that we simply cannot see easily! How will we know?


3.4.0 Investigate your variables with get_pca_var()¶

Within our PCA, we can access information regarding how our original variables are transformed into the new space. This is all stored in the var element of our PCA object. We can extract the aspects of this using the get_pca_var() and visualize these to determine the quality of our variables and their representation by the new principal components.

In [39]:
# What is the information associated with our original variables
phu.var <- get_pca_var(phu_scaled.pca)

phu.var
Principal Component Analysis Results for variables
 ===================================================
  Name       Description                                    
1 "$coord"   "Coordinates for the variables"                
2 "$cor"     "Correlations between variables and dimensions"
3 "$cos2"    "Cos2 for the variables"                       
4 "$contrib" "contributions of the variables"               

3.4.1 What is the contribution of each variable to the new components?¶

Each original variable may, to some degree, influence the amount of information captured into each new principal component. When we look at each we can see the breakdown of variables, which in some cases, can reveal to us where more or less variation is found as well.

In [40]:
# What is the contribution of each variable?
phu.var$contrib
A matrix: 7 × 7 of type dbl
Dim.1Dim.2Dim.3Dim.4Dim.5Dim.6Dim.7
0 to 410.3459040.9538402 6.8023161813.1642442.341191e+01 4.1598042 1.1619876
5 to 1113.9295015.524768913.94117607 6.1284554.187471e+01 7.7830343 0.8183575
12 to 1915.22164 2.500360822.6438688946.6144951.175802e+01 0.2326117 1.0290033
20 to 3916.20731 0.239365920.00827009 8.2052341.759244e+0137.3147126 0.4326714
40 to 5916.52966 5.7071663 2.79224811 4.6883636.551208e-0418.974939451.3069659
60 to 7915.84305 9.6533637 0.0609135312.4852063.162872e+0014.758645344.0359452
80+11.9229425.421134133.75120714 8.7140022.199400e+0016.7762526 1.2150691

From above we can see that our original variables equally contribute to our first principal component but there is much more contribution in PC2 from three variables: 0 to 4, 5 to 11 and 80+. From our previous scree plot that means that 12.8% of overall variation, which is described in PC2, is mainly from these three groups.

How many dimensions should I get and how many do I keep? It is at this point that we should discuss the maximum number of possible dimensions. Given n observations, and p features, the maximum number of dimensions after reduction is min(n-1, p). In terms of how many dimensions you keep, you should consider the general rule of thumb that at least 80% of your variation should be accounted for by the principal components that you choose. Keep that in mind!

3.4.2 Checking the quality and representation of your variables in the PCA with coord and cor¶

From the remaining variable information coord and cor can both be used to plot our original variables along different pairs of principal components. These values are the same because this is not a projection of our variables onto the PCs but a correlation of them! The distance between the origin and the variable coordinates gauges the quality of the variables on the map. This number is summarized in the cos2 (squared cosine) element.

In [41]:
# Look at the coordinates of our variables across the PCs.
phu.var$coord
phu.var$cor
A matrix: 7 × 7 of type dbl
Dim.1Dim.2Dim.3Dim.4Dim.5Dim.6Dim.7
0 to 40.7591766 0.60500184 0.13304114-0.13015935 0.1448897723 0.04293088 0.012086961
5 to 110.8808998 0.37249629 0.19046153 0.08880817-0.1937738660-0.05872294-0.010143502
12 to 190.9208514 0.14948951-0.24273527 0.24492770 0.1026801084-0.01015194-0.011374302
20 to 390.9501982-0.04625307-0.22817199-0.10275972-0.1255978821 0.12857998-0.007375568
40 to 590.9596011-0.22584976-0.08523826-0.07767621-0.0007664433-0.09169032 0.080316458
60 to 790.9394598-0.29373027 0.01258967-0.12675797 0.0532549688-0.08086425-0.074408070
80+0.8149864-0.47665795 0.29634815 0.10589763 0.0444090428 0.08621459 0.012359954
A matrix: 7 × 7 of type dbl
Dim.1Dim.2Dim.3Dim.4Dim.5Dim.6Dim.7
0 to 40.7591766 0.60500184 0.13304114-0.13015935 0.1448897723 0.04293088 0.012086961
5 to 110.8808998 0.37249629 0.19046153 0.08880817-0.1937738660-0.05872294-0.010143502
12 to 190.9208514 0.14948951-0.24273527 0.24492770 0.1026801084-0.01015194-0.011374302
20 to 390.9501982-0.04625307-0.22817199-0.10275972-0.1255978821 0.12857998-0.007375568
40 to 590.9596011-0.22584976-0.08523826-0.07767621-0.0007664433-0.09169032 0.080316458
60 to 790.9394598-0.29373027 0.01258967-0.12675797 0.0532549688-0.08086425-0.074408070
80+0.8149864-0.47665795 0.29634815 0.10589763 0.0444090428 0.08621459 0.012359954
In [42]:
# What is the quality of our variables?
# The higher the value the better!
phu.var$cos2
A matrix: 7 × 7 of type dbl
Dim.1Dim.2Dim.3Dim.4Dim.5Dim.6Dim.7
0 to 40.57634920.3660272210.01769994580.0169414572.099305e-020.00184306001.460946e-04
5 to 110.77598440.1387534840.03627559420.0078868913.754831e-020.00344838331.028906e-04
12 to 190.84796730.0223471130.05892040910.0599895801.054320e-020.00010306191.293748e-04
20 to 390.90287660.0021393460.05206245740.0105595601.577483e-020.01653281075.439901e-05
40 to 590.92083430.0510081160.00726556060.0060335945.874353e-070.00840711506.450733e-03
60 to 790.88258480.0862774740.00015849990.0160675822.836092e-030.00653902645.536561e-03
80+0.66420280.2272027980.08782222440.0112143081.972163e-030.00743295581.527685e-04

3.4.3 Plot your variable information on a unit circle with fviz_pca_var()¶

To capture all of our information in a single visualization we can use fviz_pca_var() to assess our variables. High quality variables will be closer to the circumference of the circle, low quality correlations will be closer to the origin. As we will verify from above, most of our variables are positively correlated with PC1. Across both PC1/PC2, our variables correlate highly.

In [43]:
# Compare how our variables contribute and correlate with PC1/PC2
fviz_pca_var(phu_scaled.pca, 
             col.var = "contrib", # How will we colour our data/lines
             gradient.cols = c("green", "yellow", "red"), 
             labelsize = 10,
             repel = TRUE, # make sure text doesn't overlap
             axes = c(1,2) # Determine which PCs you want to graph
            ) + 
    theme(text = element_text(size=20))
In [44]:
# Compare how our variables contribute and correlate with PC2/PC3
fviz_pca_var(phu_scaled.pca, 
             col.var = "contrib", # How will we colour our data/lines
             gradient.cols = c("green", "yellow", "red"), 
             labelsize = 10,
             repel = TRUE, # make sure text doesn't overlap
             axes = c(2,3) # Determine which PCs you want to graph
            ) + 
    theme(text = element_text(size=20))

3.5.0 Check if PHUs cluster or separate with fviz_pca_ind()¶

Now that we've generated our PCA, let's see if there is any clear separation between our PHUs across the first two dimensions of new features. Much like our variables, all of the individual coordinate data can be found in our ind element of our PCA object. Within there we'll find the coord information so we could directly build a ggplot with phu_scaled.pca$ind$coord BUT there's a simpler way.

We'll parse through the PCA object with fviz_pca_ind() to plot our data along the first two principle coordinates. We can define pointsize and col.ind to help add some population size information to our PHUs.

In [45]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=12)

# Graph our scaled PCA data.

fviz_pca_ind(phu_scaled.pca, 
             pointsize = covid_demographics_norm.mx[,1], # Set point size and colour by PHU population
             col.ind = log10(covid_demographics_norm.mx[,1]),
             repel = TRUE, # avoid overlapping text points
             labelsize = 7, 
             axes = c(1,2)
            ) + 
    
    theme(text = element_text(size=20)) + # Make our text larger

    scale_size(range = c(3, 10)) + # Update the point size range

    scale_colour_viridis_c() # Change the colour scheme
In [46]:
# Graph our UNscaled PCA data.

fviz_pca_ind(phu_unscaled.pca, 
             pointsize = covid_demographics_norm.mx[,1], # Set point size and colour by PHU population
             col.ind = log10(covid_demographics_norm.mx[,1]),
             repel = TRUE, # avoid overlapping text points
             labelsize = 7, 
             axes = c(1,2)
            ) + 
    
    theme(text = element_text(size=20)) + # Make our text larger

    scale_size(range = c(3, 10)) + # Update the point size range

    scale_colour_viridis_c() # Change the colour scheme
In [47]:
# What are the contributions of each age group to the various unscaled dimensions
phu_unscaled.pca$var$contrib
A matrix: 7 × 7 of type dbl
Dim.1Dim.2Dim.3Dim.4Dim.5Dim.6Dim.7
0 to 4 2.98836618.409014024.4647380 7.30843762 7.773297535.27429361 3.78185308
5 to 11 5.820004 9.836732629.0219186 3.15138188 5.569501844.23345570 2.36700587
12 to 19 9.644284 9.4830079 0.218335678.72062273 0.1265606 0.52509861 1.28209108
20 to 3931.169813 8.354575426.3862589 7.0595820624.0205061 2.96785850 0.04140577
40 to 5918.763509 0.5752396 4.0847938 0.0889935627.608250112.9267580235.95245607
60 to 7911.718808 2.7443694 0.7180184 3.0520735126.2061332 0.0218753555.53872184
80+19.89521650.597061215.1059367 0.61890865 8.6957506 4.05066020 1.03646629
In [48]:
# Look at our unscaled eigenvalues directly
phu_unscaled.pca$eig
A matrix: 7 × 3 of type dbl
eigenvaluepercentage of variancecumulative percentage of variance
comp 118984392.3883.2951059 83.29511
comp 2 2249160.27 9.8683191 93.16343
comp 3 855190.77 3.7521983 96.91562
comp 4 318216.08 1.3961912 98.31181
comp 5 226908.94 0.9955759 99.30739
comp 6 121882.36 0.5347658 99.84216
comp 7 35975.29 0.1578437100.00000

3.5.1 Scaled vs unscaled data¶

As you can see from our graphing of both PCA sets, when we choose not to scale our data something a little more drastic happens. PC1's portion of variance increases to a 83.2% accounting of overall variance. We see many of our points move and separation between some of our PHUs is increased (Kingston/Frontenac and Peel). These changes are likely due to the larger range of values from the 20 to 39 age group which now contributes to 31% of PC1's variation.

So think carefully about your features and what they represent. Depending on their range, and unit values, you may be inadvertantly weighing them more than your other features! Scaling adjusts your data so that all features are weighted equally!

3.6.0 Can we cluster our transformed data?¶

Now that we've effectively reduced the complexity of our data, can we produce the same kmeans clustering as before? Let's try to cluster just on the two dimensions presented, which we have already graphed!

In [49]:
# How many clusters should we use based on our PCA data?
fviz_nbclust(phu_scaled.pca$ind$coord[,1:2], kmeans, method="wss")
fviz_nbclust(phu_scaled.pca$ind$coord[,1:2], kmeans, method="silhouette")
In [50]:
# Generate our kmeans analysis and visualize it

set.seed(123)

phu_PCA_kmeans.plot <-

    kmeans(phu_scaled.pca$ind$coord[, 1:2], centers = 3) %>% 

    fviz_cluster(., # our k-means object
                 data = phu_scaled.pca$ind$coord[, 1:2], # Our original data needed for PCA to visualize
                 ellipse.type = "convex", 
                 ggtheme = theme_bw(), 
                 repel=TRUE, # Try to avoid overlapping text
                 labelsize = 20,
                 pointsize = 4,
                 main = "K-means clustering of PHU by normalized case number AFTER Principal Component Analysis"
                 ) +

    # Set some ggplot theme information
    theme(text = element_text(size=20)) +

    # Set the colour and fill scheme to viridis
    scale_fill_viridis_d() +
    scale_colour_viridis_d()

# phu_pca_kmeans.plot
In [51]:
# Plot both of our figures together
fig.show="hold"; out.width="50%"; out.height = "50%"
phu_kmeans.plot
phu_PCA_kmeans.plot

3.6.1 Was it worth it to use PCA?¶

From our estimations, the clustering choices look near identical and this is not exactly the best dataset to work with since it is rather small but you can see that we have now transformed our data into just 2 features! Could we do the same using just 2 features from our original dataset? No! So imagine transforming a much larger and feature-heavy dataset into a smaller number of features. From the data, we are also able to discern which of our original features may really contribute to each new principal component!

Something to consider the next time you're working with a large data set!


3.7.0 PCA on a large RNA-Seq dataset¶

Now that we've walked through how to generate a PCA for a simpler dataset, let's return to our RNA-Seq readcount data. We won't trim down the data at all but rather just provide the entire dataset as a matrix for feature reduction. Let's review the steps:

  1. Generate a matrix of your data, naming the columns and rows by whatever information you might have. Make sure your columns represent your features, and rows are observations.
  2. Create the PCA object.
  3. Review your eigenvalues and eigenvectors to assess how well your reduction worked. Determine how many dimensions you wish to use in further analysis.
  4. Calculate the number of possible clusters.
  5. Plot the results.
In [52]:
# 1. Generate our matrix from the readcount data
wyler_readcounts_all.mx <-

    wyler_readcounts.df %>% 
    # Rename the columns by removing the first portion: AECII_xx
    rename_with(., ~ str_replace(string = .x, 
                                 pattern = r"(\w*_\d*_)", 
                                 replace = "")) %>% 

    # Keep only the experimental observations (how many are there?)
    select(c(3:38)) %>% 

    # Convert to a matrix
    as.matrix() %>% 

    # Transpose the matrix so our columns are now genes
    t()

# Name the rows of the matrix
colnames(wyler_readcounts_all.mx) <- wyler_readcounts.df$gene

head(wyler_readcounts_all.mx)
A matrix: 6 × 27011 of type dbl
A1BGA1BG-AS1A1CFA2MA2M-AS1A2ML1A2MP1A3GALT2A4GALTA4GNT...ZWILCHZWINTZXDAZXDBZXDCZYG11AZYG11BZYXZZEF1ZZZ3
24h_DMSO-122550002721007940...154111119358363263035351126486
24h_DMSO-219200102349005450...118 76 9027432615062138 841337
24h_DMSO-315380022683006860...125 94 8932831205682795 980400
24h_200nM17AAG-129110024740008820... 32 75272860354266215981012502
24h_200nM17AAG-227150003914007120... 45 7627084028316301314 911409
24h_200nM17AAG-330170024220008200... 41 76263850315165915201015516
In [53]:
# 2. Generate the PCA object
readcounts_scaled.pca <- PCA(wyler_readcounts_all.mx, 
                             scale.unit = TRUE, # What happens when we don't scale the data?
                             ncp = 40,
                             graph = TRUE)
In [54]:
# 3. Review how much variance is explained by the new components
readcounts_scaled.pca$eig
A matrix: 35 × 3 of type dbl
eigenvaluepercentage of variancecumulative percentage of variance
comp 15806.719326.1034806 26.10348
comp 23758.664916.8966729 43.00015
comp 32142.1606 9.6298519 52.63001
comp 41712.0983 7.6965535 60.32656
comp 5 946.6334 4.2554883 64.58205
comp 6 658.5698 2.9605295 67.54258
comp 7 529.6958 2.3811906 69.92377
comp 8 384.6916 1.7293398 71.65311
comp 9 329.8880 1.4829760 73.13608
comp 10 315.1489 1.4167177 74.55280
comp 11 303.3286 1.3635810 75.91638
comp 12 300.6329 1.3514629 77.26784
comp 13 284.9892 1.2811383 78.54898
comp 14 275.5697 1.2387937 79.78778
comp 15 268.1825 1.2055854 80.99336
comp 16 261.4964 1.1755291 82.16889
comp 17 257.5704 1.1578801 83.32677
comp 18 252.9920 1.1372984 84.46407
comp 19 240.7963 1.0824740 85.54654
comp 20 235.2360 1.0574779 86.60402
comp 21 230.5237 1.0362945 87.64032
comp 22 226.9867 1.0203942 88.66071
comp 23 222.2371 0.9990429 89.65975
comp 24 214.7536 0.9654017 90.62515
comp 25 212.5614 0.9555469 91.58070
comp 26 207.7479 0.9339084 92.51461
comp 27 199.1293 0.8951644 93.40977
comp 28 198.1217 0.8906346 94.30041
comp 29 193.6473 0.8705206 95.17093
comp 30 189.7357 0.8529363 96.02387
comp 31 181.8398 0.8174410 96.84131
comp 32 180.6928 0.8122851 97.65359
comp 33 177.0252 0.7957976 98.44939
comp 34 175.3613 0.7883177 99.23771
comp 35 169.5720 0.7622925100.00000
In [ ]:
# 4. How many clusters should we use based on our PCA data?
fviz_nbclust(readcounts_scaled.pca$ind$coord[,1:2], kmeans, method="silhouette")
In [ ]:
# 5. Generate our kmeans analysis and visualize it
set.seed(123)

readcounts_PCA_kmeans.plot <-

    # Let's go with the suggested silhouette number of 5 clusters
    kmeans(readcounts_scaled.pca$ind$coord[,1:2], centers = ...) %>% 

    fviz_cluster(., # our k-means object
                 data = readcounts_scaled.pca$ind$coord[,1:2], # Our original data needed for PCA to visualize
                 ellipse.type = "convex", 
                 ggtheme = theme_bw(), 
                 repel=TRUE, # Try to avoid overlapping text
                 labelsize = 20,
                 pointsize = 4,
                 main = "K-means clustering of PHU by normalized case number AFTER Principal Component Analysis"
                 ) +

    # Set some ggplot theme information
    theme(text = element_text(size=20)) +

    # Set the colour and fill scheme to viridis
    scale_fill_viridis_d() +
    scale_colour_viridis_d()

# Look at our plot
readcounts_PCA_kmeans.plot

4.0.0 Non-linear projection¶

How do we identify trends or groups within deeply complex data in an unsupervised manner?

So we just spent most of our time trying to understand how to transform our sets based on the large amounts of variation in our data. There are some limitations we've discussed but in some cases, depending on your dataset size you may be interested in finding small, local similarities rather than the large variation that comes with PCA.

There are two popular projections we'll discuss today but first let's put together a more complex dataset based on some RNAseq data in GSE150316_DeseqNormCounts_final.txt and it's companion file 2020.07.30.20165241-supp_tables.xlsx

In [ ]:
# Read in our RNAseq data
tissue_data.df <- ...(file = ...,
                             header = TRUE,
                             row.names = 1)
# Take a quick look at it
head(tissue_data.df)
dim(tissue_data.df)

# Read in some additional patient data
patient_data.df <- read_excel("./data/2020.07.30.20165241-supp_tables.xlsx", sheet=2)

# Take a quick look at it
head(patient_data.df)
dim(patient_data.df)

4.0.1 Reformat our patient meta data¶

First let's examine our patient_data.df which has 24 patient samples (aka cases) listed in total. These 24 cases relate back to tissue_data.df which have variables representing different combinations of case and tissue type.

At this point we want to reformat the column names in patient_data.df a bit before selecting for just the viral load and viral load percentage information located in the 2nd and 3rd column. We'll hold onto this in patient_viral_load.df for later use.

In [ ]:
# Create a dataframe holding just the viral_load information for each patient
patient_viral_load.df <-
    patient_data.df %>% 
  
    # Retain just the first 3 columns
    select(1:3) %>% 

    # Rename the 2nd and 3rd columns
    ...(case_num = 1, viral_load = 2, viral_load_percent = 3)

head(patient_viral_load.df)

4.0.2 Prepare our RNAseq tissue data for analysis by eliminating features¶

We now want to format tissue_data.df much in the way we did with our PHU data. We want to convert our current data which lists genes as observations and tissue samples as columns. Essentially, we'd like to transpose this and we could do that but the transposition converts everything to a matrix. In the end, we want to work with a data frame so we can hold more information.

To reduce on memory and runtime, however, we should trim our dataset. We aren't really interested in looking at 59,090 genes - many of which may be barely expressed. Since these are normalized counts across the dataset, we can filter out low-expression genes to make tissue_data_filtered.df. Yes, this would again be considered a form of feature elimination. In general we will:

  1. Convert the row names to their own column.
  2. Calculate the mean of the normalized counts across each observation and filter for a minimum of 0.5.
  3. Calculate the variance across each observation as a backup plan.
In [ ]:
# Trim the tissue data down
tissue_data_filtered.df <-

    tissue_data.df %>% 

    # Convert the row names to a column
    rownames_to_column(var="gene") %>% 

    # Set up the table to perform row-wise operations
    rowwise() %>% 

    # Calculate the mean expression of each gene across all tissue samples
    mutate(mean = ...(c_across(where(is.numeric)))) %>% 

    # Filter for samples with low expression
    filter(mean > 0.5) %>% 

    # Calculate overall variance in case we need to make our dataset smaller
    mutate(variance = ...(c_across(where(is.numeric)))) %>% 

    # Arrange samples by descending variance
    arrange(desc(variance)) %>% 

    # Remove the grouping specification
    ungroup()

# Take a look at the final results
head(tissue_data_filtered.df)

# how big is our filtered data frame?
dim(tissue_data_filtered.df)

4.0.3 More data wrangling to transpose our RNAseq data¶

Now that we've filtered our data down to ~29k genes, we'll run through two more steps:

  1. We'll tranpose our data using a combination of pivot_longer() and pivot_wider().
  2. We'll use a series of string matches and joins to add some sample information to our data. It won't be used for our analysis but will be used when we plot the results!
  3. Lastly we'll convert the relevant parts of our data frame to a matrix. There is a huge memory savings in the algorithm when working with a matrix and you have limited memory on your Jupyter Hubs!
In [ ]:
# You need to transpose the data. 
# We can do it with dplyr to keep it as a data frame and to add some info

tissue_RNAseq.df <-

    tissue_data_filtered.df %>% 
    
    # trim down the columns to drop mean/variance
    select(1:89) %>% 

    # pivot longer
    pivot_longer(cols=c(2:89), names_to = "sample", values_to = ...) %>% 

    # redistribute the gene names to columns
    pivot_wider(names_from = gene, values_from = ...)
In [ ]:
# We want to add some additional sample information before assessing the data

tissue_RNAseq.df <-

    tissue_RNAseq.df %>% 

    # Grab just the sample names
    select(sample) %>% 

    # Grab information from it like case number, tissue, and tissue number
    # takes the form of "caseX.tissueY" or "caseX.tissue.NYC" or "NegControlX"
    # Remember that this returns a LIST of character matrices
    str_match_all(., pattern=r"(case([\w]+)\.([a-z]+)([\d|\.NYC]*)|(NegControl\d))") %>% 

    # Bind all the matrices from all the list elements together in a single object (likely a matrix)
    do.call(rbind, .) %>% 

    # Convert the results to a data frame
    as.data.frame() %>% 

    # Rename the columns based on the capture groups
    rename(., sample = V1, case_num = V2, tissue = V3, tissue_num = V4, neg_num = V5) %>% 

    # Coalesce some of the info due to negative control samples and clean up a column
    mutate(case_num = coalesce(case_num, neg_num),
           tissue_num = str_replace_all(.$tissue_num, pattern = "\\.", replace = "")) %>%  

    # Drop the neg_num column
    select(1:4) %>% 

    # Join this result to the RNAseq info
    full_join(., y=tissue_RNAseq.df, by=c("sample" = "sample")) %>% 

    # Join that result to grab viral load information
    right_join(patient_viral_load.df, y=., by=c("case_num" = "case_num"))

# Look at the resulting dataframe
head(tissue_RNAseq.df)
In [ ]:
# How many tissue types do we have?
table(tissue_RNAseq.df$tissue)
In [ ]:
# Generate a matrix version of our data but drop the sample metadata!
tissue_RNAseq.mx <- as.matrix(tissue_RNAseq.df[,...])

head(tissue_RNAseq.mx)
dim(tissue_RNAseq.mx)

4.1.0 t-Distributed Stochastic Neighbour Embedding with the Rtsne package¶

We now have a somewhat more complex dataset. We are still short on actual samples (now observations) but 88 observations and nearly 30K features isn't so bad. A broad question we may wish to ask with such a data set is, if there is an underlying structure to these samples - such as do we see grouping based on tissue type, or perhaps even sample preparation.

t-Distributed Stochastic Neighbour Embedding or t-SNE is a way for us to project our high-dimension data onto a lower dimension with the aim at preserving the local similarities rather than global disparity. When looking at data points, t-SNE will attempt to preserve the local neighbouring structure. As the algorithm pours over the data, it can use different transformations for different regions as it attempts to transform everything to a lower dimension. It is considered "incredibly flexible" at finding local structure where other algorithms may not.

This flexibility is accomplished through 2 steps:

  1. Reduce dimensionality of your data features with PCA!
  2. Generate a probability distribution between all pairs by making similar objects highly probably and assigning dissimilar objects a low probability.
  3. Define a similar probability distribution for the samples in a lower dimension while minimizing the divergence between the two distributions based on a distance metric between points in the lower dimension.

We'll discuss more about how this algorithm affect interpretation after seeing the results, but this is considered an exploratory data visualization tool, rather than explanatory.

To produce our t-SNE projection we'll use the Rtsne() function from the package of the same name. Some important parameters are:

  • X is our data matrix where each row is an observation
  • dims sets the number of dimensions we'd like to project onto (default is 2).
  • initial_dims sets the number of dimensions that should be retained in the initial PCA step (Default 50).
  • perplexity a numeric parameter that tunes between local and global aspects of your data.
    • This parameter is a guess as to how many close neighbours a point may have. If you have a sense of sample types (or clusters!) ahead of time, you could try to play with this value (default is 30).
    • According to the algorithm this value should follow this rule: $perplexity * 3 \lt nrow(X) -1$
  • pca_scale is a boolean to set if the initial PCA step should use scaled data.
  • max_iter is the maximum number of iterations in the algorithm (default is 1000).
In [ ]:
# Rtsne prefers using a matrix for memory issues
# set a seed for reproducible results
set.seed(1981)

# Try for perplexity of 30 can go as high as 29 before crash
# We have just 90 samples, but between 1-52 samples per "group"
tissue_tsne <- Rtsne(..., 
                     dims=2, 
                     perplexity=5, 
                     verbose=TRUE, 
                     pca_scale = TRUE,
                     max_iter = 1500)
In [ ]:
str(...)

4.1.1 Extract information from our tsne object¶

Looking above at the result of our t-SNE analysis, we can notice a few things...

  1. We get back the number of objects we put in: 88.
  2. The element Y is an 88x2 matrix holding the coordinates for our new projection.
  3. There are no eigenvectors or other information about the variables or dimensions or how they were used in the projection!

That's right, there is no underlying information for mapping back to our original dataset. It's a completely black box with no way to reverse engineer the process. That's because the process itself is stochastic! Whereas PCA was a deterministic process - repeatable the same way every time with the same data - that is not the case for t-SNE. That's why even though it can be quite powerful in identifying local similarities, it does not provide a mathematical pathway back to our original data!

Let's extract the information, combine it with our sample information and project it using ggplot2

In [ ]:
# Build our new data frame from the Y values
tissue_tsne.df <- data.frame(x.coord = ...,
                             y.coord = ...)

# Add in our sample information
tissue_tsne.df <- cbind(tissue_RNAseq.df[,1:6], tissue_tsne.df)

# Fix up the information just a little bit to remove NA viral load information
tissue_tsne.df <-
    tissue_tsne.df %>% 
    # replace NAs with DNW (did not work)
    mutate(viral_load = replace_na(viral_load, replace = "DNW"))

head(tissue_tsne.df)
In [ ]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)

combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(12, "Set3"), brewer.pal(8, "Set1"))

# 1. Data
ggplot(data = tissue_tsne.df) +
    # 2. Aesthetics
    aes(x = x.coord, 
        y = y.coord, 
        colour = ...) +

    # Themes
    theme_bw() +
    theme(text = element_text(size=20)) +

    # 3. Scaling
    scale_colour_manual(values = combo.colours) +

    # 4. Geoms
    geom_text(aes(label = case_num), size = 10)

4.1.2 Interpreting our t-SNE plot¶

While we don't have a lot of samples, you can still see that we were able to cluster some of our data by cell types without providing that classification to the algorithm! Great job team!

We can see that we get close clustering of tissues like placenta, heart, and bowel. Our liver samples are kind of everywhere but perhaps using a different perplexity would provide different results.

One interesting thing we can see is that regardless of tissue type, we see some samples are clustering together based on case number - namely case numbers 1, 4 and 5 seem to have some strong underlying expression profiles that connect them across tissue samples. We may also be seeing false relationships so beware!


4.2.0 Uniform Manifold Approximation and Projection¶

This algorithm for projection is in the same flavour as t-SNE projection but has some differences including:

  1. Increased speed and better preservation of the global structure in your data
  2. A different theoretical foundation used to balance between local and global structure

What does that mean for us? Faster results, and more interpretive results! Otherwise the same issues can apply. The setup is slightly easier with few options to change if you leave the defaults. We can access umap() from the package of the same name. You may also alter the default methods by creating a umap.defaults object. More information on that here

For more tinkering, you can choose to use the uwot package instead where the umap() function has more options that are easily modified.

In [ ]:
# Set our seed
set.seed(1981)

# Generate our projection
tissue_umap <- ...(tissue_RNAseq.mx)
In [ ]:
str(tissue_umap)

4.2.1 Extract information from our UMAP object¶

Looking at our UMAP object tissue_umap we see it houses the projection parameters used but also some additional variables:

  1. data: holds our original data matrix.
  2. layout: contains the projection coordinates we need for plotting the data.
  3. knn: a weighted k-nearest neighbour graph. This is a graph that connects each observation to its nearest k neighbours. This generates the first topological representation of the data - like an initial sketch.

You may notice again that there is no data that suggests how we arrived at this solution. There are no eigenvectors or values to reverse the projection!

Let's extract the layout information, combine it with our sample information and project it using ggplot2

In [ ]:
# Re-map our projection points with our tissue data

tissue_umap.df <- data.frame(x.coord = tissue_umap$...[,1],
                             y.coord = tissue_umap$...[,2])

tissue_umap.df <- cbind(tissue_RNAseq.df[,1:6], tissue_umap.df)

tissue_umap.df <-
    tissue_umap.df %>% 
    # replace NAs with DNW (did not work)
    mutate(viral_load = replace_na(viral_load, replace = "DNW"))

head(tissue_umap.df)
In [ ]:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)

combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(12, "Set3"), brewer.pal(8, "Set1"))

# 1. Data
ggplot(data = ...) +
    # 2. Aesthetics
    aes(x = x.coord, 
        y = y.coord, 
        colour = tissue) +

    # Themes
    theme_bw() +
    theme(text = element_text(size=20)) +

    # 3. Scaling
    scale_colour_manual(values = combo.colours) +

    # 4. Geoms
    geom_text(aes(label = case_num), size = 10)

4.2.2 Interpreting our UMAP result¶

So it looks like without much tinkering we retrieved a fairly nice result. We can see now that liver and kidney are more closely grouped as tissues, while heart samples generally cluster together still. Bowel and jejunum appear spatially grouped and our placenta samples are still close to each other. The clustering we saw with samples 1, 4, and 5 appear to be less severe.

There does appear to be some structure between the lung samples in different case numbers so this might be an avenue to explore next to try and see if there truly is a relationship between these groups.


5.0.0 Class summary¶

While t-SNE and UMAP produce projections do produce clustered data, you have no route back to understanding their relationships. PCA, on the other hand, is strictly a dimension reduction tool. It does not place or assign datapoints to any groups BUT it is useful to use on large datasets prior to clustering!

Today we took a deep dive into principal component analysis. There are of course different variants of this based on the assumptions you can make about your observations and variables like independent component analysis (ICA, non-Gaussian features) and multiple correspondence analysis (MCA, categorical features). Some additional methods can also be used to store the transformation like a PCA does, notably variational autoencoders (VAE).

Overall we should remember that while PCA can have problems in generating it's feature extraction, it is deterministic and repeatable. Also, the final results are provided in such a way that new observations could be transformed and projected onto the same principal components. You can also feed these components back into clustering algorithms like k-means to try and identify specific subgroups.

t-SNE and UMAP, on the other hand appear to do a much better job with high-dimensional data. They can preserve local structure and UMAP can also do a fairly good job of preserving global structure. These tools make for great exploratory analysis of your complex datasets. Interpretation of relationships, however, are not mathematically clear like in PCA. These are, after all projections from a higher dimension for our simpler primate brains!


5.1.0 Weekly assignment¶

This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 9:59 am on the date of our next class: April 18th, 2023.


5.2.0 Acknowledgements¶

Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


5.3.0 References¶

More information on calculating optimal clusters: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/

Step-by-step how PCA works: https://builtin.com/data-science/step-step-explanation-principal-component-analysis

More PCA explanations here: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

The math of PCA: https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf

t-SNE in R and Python: https://datavizpyr.com/how-to-make-tsne-plot-in-r/

All about UMAP: https://umap-learn.readthedocs.io/en/latest/basic_usage.html


The Center for the Analysis of Genome Evolution and Function (CAGEF)¶

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.